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In this paper we review studies of the growth of the Internet and technologies that 
are useful for information search and retrieval on the Web. We present data on the 
Internet from several different sources, e.g., current as well as projected number of 
users, hosts, and Web sites. Although numerical figures vary, overall trends cited 
by the sources are consistent and point to exponential growth in the past and in 
the coming decade. Hence it is not surprising that about 85% of Internet users 
surveyed claim using search engines and search services to find specific 
information. The same surveys show, however, that users are not satisfied with the 
performance of the current generation of search engines; the slow retrieval speed, 
communication delays, and poor quality of retrieved results (e.g., noise and broken 
links) are commonly cited problems. We discuss the development of new techniques 
targeted to resolve some of the problems associated with Web-based information 
retrieval, and speculate on future trends. 

Categories and Subject Descriptors: G.1.3 [Numerical Analysis]: Numerical 
Linear Algebra— Eigenvalues and eigenvectors (direct and iterative methods); 
Singular value decomposition; Sparse, structured and very large systems (direct and 
iterative methods); G.l.l [Numerical Analysis]: Interpolation; R3.1 
[Information Storage and Retrieval]: Content Analysis and Indexing; H.3.3 
[Information Storage and Retrieval]: Information Search and 
Retrieval — Clustering; Retrieval models; Search process; H.m [Information 
Systems]: Miscellaneous 

General Terms: Algorithms, Theory 

Additional Key Words and Phrases: Clustering, indexing, information retrieval, 
Internet, knowledge management, search engine, World Wide Web 



1. INTRODUCTION 

We review some notable studies on the 
growth of the Internet and on technolo- 
gies useful for information search and 
retrieval on the Web. Writing about the 
Web is a challenging task for several 
reasons, of which we mention three. 
First, its dynamic nature guarantees 
that at least some portions of any 



manuscript on the subject will be out-of- 
date before it reaches the intended au- 
dience, particularly URLs that are ref- 
erenced. Second, a comprehensive 
coverage of all of the important topics is 
impossible, because so many new ideas 
are constantly being proposed and are 
either quickly accepted into the Internet 
mainstream or rejected. Finally, as with 
any review paper, there is a strong bias 
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in presenting topics closely related to 
the authors' background, and giving 
only cursory treatment to those of which 
they are relatively ignorant. In an at- 
tempt to compensate for oversights and 
biases, references to relevant works 
that describe or review concepts in 
depth will be given whenever possible. 
This being said, we begin with refer- 
ences to several excellent books that 
cover a variety of topics in information 
management and retrieval. They in- 
clude Information Retrieval and Hyper- 
text [Agosti and Smeaton 1996]; Modern 
Information Retrieval [Baeza-Yates and 
Ribeiro-Neto 1999]; Text Retrieval and 
Filtering: Analytic Models of Perfor- 
mance [Losee 1998]; Natural Language 
Information Retrieval [Strzalkowski 
1999]; and Managing Gigabytes [Witten 
et al. 1994]. Some older, classic texts, 
which are slightly outdated, include In- 
formation Retrieval [Frakes and Baeza- 
Yates 1992]; Information Storage and 
Retrieval [Korfhage 1997]; Intelligent 
Multmedia Information Retrieval [May- 
bury 1997]; Introduction to Modern In- 
formation Retrieval [Salton and McGill 
1983]; and Readings in Information Re- 
trieval [Jones and Willett 1977]. 

Additional references are to special 
journal issues on search engines on the 
Internet [Scientific American 1997]; 
digital libraries [CACM 1998]; digital 
libraries, representation and retrieval 
[IEEE 1996b]; the next generation 
graphical user interfaces (GUIs) [CACM 



1994]; Internet technologies [CACM 
1994; IEEE 1999]; and knowledge dis- 
covery [CACM 1999]. Some notable sur- 
vey papers are those by Chakrabarti 
and Rajagopalan [1997]; Faloutsos and 
Oard [1995]; Feldman [1998]; Gudivada 
et al. [1997]; Leighton and Srivastava 
[1997]; Lawrence and Giles [1998b; 
1999b]; and Raghavan [1997], Exten- 
sive, up-to-date coverage of topics in 
Web-based information retrieval and 
knowledge management can be found in 
the proceedings of several conferences, 
such as: the International World Wide 
Web Conferences [WWW Conferences 
2000] and the Association for Comput- 
ing Machinery's Special Interest Group 
on Computer-Human Interaction [ACM 
SIGCHI] and Special Interest Group on 
Information Retrieval [ACM SIGIR] 
conferences <acm.org>. A list of papers 
and Web pages that review and compare 
Web search tools are maintained at sev- 
eral sites, including BoutelTs World 
Wide Web FAQ <boutell.com/faq/>; 
Hamline University's <web.hamline.edu/ 
administration/Ubraries/search/comparisons. 
html>; Kuhn's pages (in German) 
<gwdg.de/hkuhnl/pagesuch.html#vl>; 
Maire's pages (in French) <imaginet.fr/ 
ime/search.htm>; Princeton University's 
<cs .princeton. edu/html/search ,html> ; 
U.C. Berkeley's <sunsite.berkeley.edu/ 
help/searchdetails.html>; and Yahool's 
pages on search engines <yahoo.com/ 
computers and internet/internet/world 
wide web>. The historical development 
of information retrieval is documented 
in a number of sources: Baeza-Yates 
and Ribeiro-Neto [1999]; Cleverdon 
[1970]; Faloutsos and Oard [1995]; Sal- 
ton [1970]; and van Rijsbergen [1979]. 
Historical accounts of the Web and Web 
search technologies are given in Berners- 
Lee et al. [1994] and Schatz [1997]. 

This paper is organized as' follows. In ? ,r 
the remainder of this section, we dis- 
cuss and point to references on ratings 
of search engines and their features, the 
growth of information available on the 
Internet, and the growth in users. In 
the second section we present tools for 
Web-based information retrieval. These 
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include classical retrieval tools (which 
can be used as is or with enhancements 
specifically geared for Web-based appli- 
cations), as well as a new generation of 
tools which have developed alongside 
the Internet. Challenges that must be 
overcome in developing and refining 
new and existing technologies for the 
Web environment are discussed. In the 
concluding section, we speculate on fu- 
ture directions in research related to 
Web-based information retrieval which 
may prove to be fruitful. 



1.1 Ratings of Search Engines and their 
Features 

About 85% of Web users surveyed claim 
to be using search engines or some kind 
of search tool to find specific informa- 
tion of interest. The list of publicly ac- 
cessible search engines has grown enor- 
mously in the past few years (see, e.g., 
blueangels.net), and there are now lists 
of top-ranked query terms available on- 
line (see, e.g., <searchterms.com>). 
Since advertising revenue for search 
and portal sites is strongly linked to the 
volume of access by the public, increas- 
ing hits (i.e., demand for a site) is an 
extremely serious business issue. Un- 
doubtedly, this financial incentive is 
serving as one the major impetuses for 
the tremendous amount of research on 
Web-based information retrieval. 

One of the keys to becoming a popular 
and successful search engine lies in the 
development of new algorithms specifi- 
cally designed for fast and accurate re- 
trieval of valuable information. Other 
features that make a search or portal 
site highly competitive are unusually 
attractive interfaces, free email ad- 
dresses, and free access time [Chan- 
drasekaran 1998]; Quite often, these ad- 
vantages last at most a few weeks, since 
competitors keep track of new develop- 
ments (see, e.g., <portalhub.com> or 
<traffik.com>, which gives updates and 
comparisons on portals). And sometimes 
success can lead to unexpected conse- 
quences: 



"Lycos, one of the biggest and most popular search 
engines, is legendary for its unavailability dur- 
ing work hours." [Webster and Paul 1996] 

There are many publicly available 
search engines, but users are not neces- 
sarily satisfied with the different for- 
mats for inputting queries, speeds of 
retrieval, presentation formats of the 
retrieval results, and quality of re- 
trieved information [Lawrence and 
Giles 1998b]. In particular, speed (i.e., 
search engine search and retrieval time 
plus communication delays) has consis- 
tently been cited as "the most commonly 
experienced problem with the Web w in 
the biannual WWW surveys conducted 
at the Graphics, Visualization, and Us- 
ability Center of the Georgia Institute 
of Technology. 1 63% to 66% of Web us- 
ers in the past three surveys, over a 
period of a year-and-a-half were dissat- 
isfied with the speed of retrieval and 
communication delay, and the problem 
appears to be growing worse. Even 
though 48% of the respondents in the 
April 1998 survey had upgraded mo- 
dems in the past year, 53% of the re- 
spondents left a Web site while search- 
ing for product information because of 
"slow access." "Broken links" registered 
as the second most frequent problem in 
the same survey. Other studies also cite 
the number one and number two rea- 
sons for dissatisfaction as "slow access 19 
and "the inability to find relevant infor- 
mation" respectively [Huberman and 
Lukose 1997; Huberman et al. 1998]. In 
this paper we elaborate on some of the 
causes of these problems and outline 
some promising new approaches being 
developed to resolve them. 

It is important to remember that 
problems related to speed and access 
time may not be resolved by considering 
Web-based information access and re- 
trieval as an isolated scientific problem. 
An August 1998 survey by Alexa Internet 



1 GVU*8 user survey (available at <gvu.gatech. 
edu/user survey a/>) is one of the more reliable 
sources on user data. Its reports have been en- 
dorsed by the World Wide Web Consortium (W3C) 
and INRIA. 
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<alexa.com/company/inthenews/webfacts. 
html> indicates that 90% of all Web 
traffic is spread over 100,000 different 
hosts, with 50% of all Web traffic 
headed towards the top 900 most popu- 
lar sites. Effective means of managing 
uneven concentration of information 
packets on the Internet will be needed 
in addition to the development of fast 
access and retrieval algorithms. 

The volume of information on search 
engines has exploded in the past year. 
Some valuable resources are cited be- 
low. The University of California at Ber- 
keley has extensive Web pages on "how 
to choose the search tools you need" 
<lib.berkeley.edu/teachinglib/guides/ 
internet/toolstables.html>. In addition 
to general advice on conducting 
searches on the Internet, the pages com- 
pare features such as size, case sensitiv- 
ity, ability to search for phrases and 
proper names, use of Boolean logic 
terms, ability to require or exclude spec- 
ified terms, inclusion of multilingual 
features, inclusion of special feature 
buttons (e.g., "more like this" "top 10 
most frequently visited sites on the sub- 
ject" and "refine") and exclusion of 
pages updated prior to a user-specified 
date of several popular search engines 
such as those of Alta Vista <altavista. 
com>; HotBot <hotbot.com>; Lycos Pro 
Power Search <lycos.com>; Excite <ex- 
cite.com>; Yahoo! <yahoo.com>; Info- 
seek <infoseek.com>; Disinformation 
<disinfo.com>; and Northern Light 
<nlsearch.com> . 

The work of Lidsky and Kwon [1997] 
is an opinionated but informative re- 
source on search engines. It describes 
36 different search engines and rates 
them on specific details of their search 
capabilities. For instance, in one study, 
searches are divided into five catego- 
ries: (1) simple searches; (2) custom 
searches; (3) directory searches; (4) cur- 
rent news searches; and (5) Web con- 
tent. The five categories of search are 
evaluated in terms of power and ease of 
use. Variations in ratings sometimes 
differ substantially for a given search 
engine. Similarly, query tests are con- 



ducted according to five criteria: (1) 
simple queries; (2) customized queries; 
(3) news queries; (4) duplicate elimina- 
tion; and (5) dead link elimination. Once 
again, variations in the ratings some- 
times differ substantially for a given 
search engine. In addition to ratings, 
the authors give charts on search in- 
dexes and directories associated with 
twelve of the search engines, and rate 
them in terms of specific features for 
complex searches and content. The data 
indicate that as the number of people 
using the Internet and Web has grown, 
user types have diversified and search 
engine providers have begun to target 
more specific types of users and queries 
with specialized and tailored search 
tools. 

Web Search Engine Watch <search- 
enginewatch.com/webmasters/features. 
html> posts extensive data and ratings 
of popular search engines according to 
features such as size, pages crawled per 
day, freshness, and depth. Some other 
useful online sources are home pages on 
search engines by the Gray <mit.people. 
edu/mkgray/net>; Information Today 
<infotoday.O)m/searcher^xin/story2.htm>; 
Kansas City Public Library <kcpl.lib. 
mo.us/search/srchengines.htm>; Koch 
<ub2.1u.se/desire/radar/lit-about-search- 
services.html>; Northwestern Univer- 
sity Library <library.nwu.edu/resources/ 
internet/search/evaluate.html> ; and 
Notes of Search Engine Showdown 
<imtnet/notes/search/index.html > . Data 
on international use of the Web and 
Internet is posted at the NUA Internet 
Survey home page <nua.ie/surveys>. 

A note of caution: in digesting the 
data in the paragraphs above and be- 
low, published data on the Internet and 
the Web are very difficult to measure 
and verify. GVU offers a solid piece of 
•advice on« the matter: • 

"We suggest that those interested in these (i.e., 
Internet /WWW statistics and demographics) 
statistics should consult several sources; these 
numbers can be difficult to measure and results 
may vary between different sources." [GVLPs 
WWW user survey] 

Although details of data from different 
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popular sources vary, overall trends are 
fairly consistently documented. We 
present some survey results from some 
of these sources below. 

1.2 Growth of the Internet and the Web 

Schatz [1997] of the National Center for 
Supercomputing Applications (NCSA) 
estimates that the number of Internet 
users increased from 1 million to 25 
million in the five years leading up to 
January of 1997. Strategy Alley [1998] 
gives a number of statistics on Internet 
users: Matrix Information and Direc- 
tory Services (MIDS), an Internet mea- 
surement organization, estimated there 
were 57 million users on the consumer 
Internet worldwide in April of 1998, and 
that the number would increase to 377 
million by 2000; Morgan Stanley gives 
the estimate of 150 million in 2000; and 
Killen and Associates give the estimate 
as 250 million in 2000. Nua's surveys 
<nua.ie/surveys> estimates the figure 
as 201 million worldwide in September 
of 1999, and more specifically by region: 
1.72 million in Africa; 33.61 in the Asia/ 
Pacific region; 47.15 in Europe; 0.88 in 
the Middle East; 112.4 in Canada and 
the U.S.; and 5.29 in Latin America. 
Most data and projections support con- 
tinued tremendous growth (mostly ex- 
ponential) in Internet users, although 
precise numerical values differ. 

Most data on the amount of informa- 
tion on the Internet (i.e., volume, num- 
ber of publicly accessible Web pages and 
hosts) show tremendous growth, and 
the sizes and numbers appear to be 
growing at an exponential rate. Lynch 
has documented the explosive growth of 
Internet hosts; the number of hosts has 
been roughly doubling every year. For 
example, he estimates that it was 1.3 
million in January of* 1993, 2:2 million 
in January of 1994, 4.9 million in Janu- 
ary of 1995, and 9.5 million in January 
of 1996. His last set of data is 12.9 
million in July of 1996 [Lynch 1997]. 
Strategy Alley [1998] cites similar fig- 
ures: "Since 1982, the number of hosts 
has doubled every yearT And an article 



by the editors of the IEEE Internet 
Computing Magazine states that expo- 
nential growth of Internet hosts was 
observed in separate studies by several 
experts [IEEE 1998a], such as Mark 
Lottor of Network Wizards <nw.com>; 
Mirjan Kiihne of the RIPE Network 
Control Center <.ripe.net> for a period 
of over ten years; Samarada Weera- 
handi of Bellcore on his home page on 
Internet hosts <ripe.net> for a period 
of over five years in Europe; and John 
Quarterman of Matrix Information and 
Directory Services <mids.org>. 

The number of publicly accessible 
pages is also growing at an aggressive 
pace. Smith [1973] estimates that in 
January of 1997 there were 80 million 
public Web pages, and that the number 
would subsequently double annually. 
Bharat and Broder [1998] estimated 
that in November of 1997 the total num- 
ber of Web pages was over 200 million. 
If both of these estimates for number of 
Web pages are correct, then the rate of 
increase is higher than Smith's predic- 
tion, i.e., it would be more than double 
per year. In a separate estimate [Monier 
1998], the chief technical officer of Alta- 
Vista estimated that the volume of pub- 
licly accessible information on the Web 
has grown from 50 million pages on 
100,000 sites in 1995 to 100 to 150 
million pages on 600,000 sites in June 
of 1997. Lawrence and Giles summarize 
Web statistics published by others: 80 
million pages in January of 1997 by the 
Internet Archive [Cunningham 1997], 
75 million pages in September of 1997 
by Forrester Research Inc. [Guglielmo 
1997], Monier's estimate (mentioned 
above), and 175 million pages in Decem- 
ber 1997 by Wired Digital. Then they 
conducted their own experiments to es- 
timate the size of the Web and con- 
cluded that: 

"it appears that existing estimates significantly 
underestimate the size of the Web." [Lawrence 
and Giles 1998b] 

Follow-up studies by Lawrence and 
Giles [1999a] estimate that the number 
of publicly indexable pages on the Web 
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at that time was about 800 million 
pages (with a total of 6 terabytes of text 
data) on about 3 million servers (Law- 
rence's home page: <neci.nec.cim/lawrence/ 
papers.html>). On Aug. 31 1998, Alexa 
Internet announced its estimate of 3 
terabytes or 3 million megabytes for the 
amount of information on the Web, with 
20 million Web content areas; a content 
area is defined as top-level pages of 
sites, individual home pages, and signif- 
icant subsections of corporate Web sites. 
Furthermore, they estimate a doubling 
of volume every eight months. 

Given the enormous volume of Web 
pages in existence, it comes as no sur- 
prise that Internet users are increas- 
ingly using search engines and search 
services to find specific information. Ac- 
cording to Brin and Paige, the World 
Wide Web Worm(homepages: <cs.coIorado. 
edu/wwww> and <guano.cs.colorado. 
edu/wwww>) claims to have handled an 
average of 1,500 queries a day in April 
1994, and AltaVista claims to have han- 
dled 20 million queries in November 
1997. They believe that 

"it is likely that top search engines will handle 
hundreds of millions (of queries) per day by the 
year 2000" [Brin and Page 1998] 

The results of GVU's April 1998 
WWW user survey indicate that about 
86% of people now find a useful Web 
site through search engines, and 85% 
find them through hyperlinks in other 
Web pages; people now use search en- 
gines as much as surfing the Web to 
find information. 

1.3 Evaluation of Search Engines 

Several different measures have been 
proposed to quantitatively measure the 
performance of classical information re- 
* 4 "'" ' trieval systems (see, e.g., Losee [1998]; 
Manning and Schutze [1999]), most of 
which can be straightforwardly ex- 
tended to evaluate Web search engines. 
However, Web users may have a ten- 
dency to favor some performance issues 
more strongly than traditional users of 
information retrieval systems. For ex- 
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ample, interactive response times ap- 
pear to be at the top of the list of 
important issues for Web users (see Sec- 
tion 1.1) as well as the number of valu- 
able sites listed in the first page of 
retrieved results (i.e., ranked in the top 
8, 10, or 12), so that the scroll down or 
next page button do not have to be in- 
voked to view the most valuable results. 

Some traditional measures of infor- 
mation retrieval system performance 
are recognized in modified form by Web 
users. For example, a basic model from 
traditional retrieval systems recognizes 
a three way trade-off between the speed 
of information retrieval, precision, and 
recall (which is illustrated in Figure 1). 
This trade-off becomes increasingly dif- 
ficult to balance as the number of docu- 
ments and users of a database escalate. 
In the context of information retrieval, 
precision is defined as the ratio of rele- 
vant documents to the number of re- 
trieved documents: 

: r; :: ;;;:;;. ; "z iiuniber of ^reievan t- dcicuin ents :=:!: • 
: . ; precision H =^-=i r -^-H^;H=|=lrn= — =: sS ^i!H?t 

v_ u-j ijj;:;;. iffi i?uni^er^CTetr,iev : ed docuineii.ts "'-^ 

[ikncj recqll is <i^fi/i^d : as ^:Ae : :proppryim of ;l 
: ^ejeyant 'di>^ • 
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Most Web users who utilize search en- 
gines are not so much interested in the 
traditional measure of precision as the 
precision of the results displayed in the 
first page of the list of retrieved docu- 
ments, before a "scroll" or "next page" 
command is used. Since there is little 
hope of actually measuring the recall 
rate for each Web search engine query 
and retrieval job— and in many cases 
there may be too many relevant pag- 
es—a Web user would tend to be more 
concerned about retrieving and being 
able to identify only very highly valu- 
able pages. Kleinberg [1998] recognizes 
the importance of finding the most in- 
formation rich, or authority pages. Hub 
pages, i.e., pages that have links to 
many authority pages are also recog- 
nized as being very valuable. A Web 
user might substitute recall with a mod- 
ified version in which the recall is com- 
puted with respect to the set of hub and 
authority pages retrieved in the top 10 
or 20 ranked documents (rather than all 
related pages). Details of an algorithm 
for retrieving authorities and hubs by 
Kleinberg [1998] is given in Section 2.4 
of this paper. 

Hearst [1999] notes that the user in- 
terface, i.e., the quality of human-com- 
puter interaction, should be taken into 
account when evaluating an informa- 
tion retrieval system. Nielsen [1993] ad- 
vocates the use of qualitative (rather 
than quantitative) measures to evaluate 
information retrieval systems. In partic- 
ular, user satisfaction with the system 
interface as well as satisfaction with 
retrieved results as a whole (rather 
than statistical measures) is suggested. 
Westera [1996] suggests some query for- 
mats for benchmarking search engines, 
such as: single keyword search; plural 
search capability; phrase search; Bool- 
ean search (with proper houri); and com- 
plex Boolean. In the next section we 
discuss some of the differences and sim- 
ilarities in classical and Internet-based 
search, access and retrieval of informa- 
tion. 

Hawking et al. [1999] discusses eval- 
uation studies of six text retrieval con- 



ferences (TREC) U.S. National Institute 
of Standards and Technology (NIST) 
search engines <trec.nist.gov>. In par- 
ticular, they examine answers to ques- 
tions such as "Can link information re- 
sult in better rankings?" and "Do longer 
queries result in better answers?" 

2. TOOLS FOR WEB-BASED RETRIEVAL 
AND RANKING 

Classical retrieval and ranking algo- 
rithms developed for isolated (and 
sometimes static) databases are not nec- 
essarily suitable for Internet applica- 
tions. Two of the major differences be- 
tween classical and Web-based retrieval 
and ranking problems and challenges in 
developing solutions are the number of 
simultaneous users of popular search 
engines and the number of documents 
that can be accessed and ranked. More 
specifically, the number of simultaneous 
users of a search engine at a given 
moment cannot be predicted beforehand 
and may overload a system. And the 
number of publicly accessible docu- 
ments on the Internet exceeds those 
numbers associated with classical data- 
bases by several orders of magnitude. 
Furthermore, the number of Internet 
search engine providers, Web users, and 
Web pages is growing at a tremendous 
pace, with each average page occupying 
more memory space and containing dif- 
ferent types of multimedia information 
such as images, graphics, audio, and 
video. 

There are other properties besides the 
number of users and size that set classi- 
cal and Web-based retrieval problems 
apart. If we consider the set of all Web 
pages as a gigantic database, this set is 
very different from a classical database 
with elements that can be organized, 
stored, and indexed in a manner that ' " ' - 
facilitates fast and accurate retrieval 
using a well-defined format for input 
queries. In Web-based retrieval, deter- 
mining which pages are valuable 
enough to index, weight, or cluster and 
carrying out the tasks efficiently, while 
maintaining a reasonable degree of 
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accuracy considering the ephemeral na- 
ture of the Web, is an enormous chal- 
lenge. Further complicating the problem 
is the set of appropriate input queries; 
the best format for inputting the que- 
ries is not fixed or known. In this sec- 
tion we examine indexing, clustering, 
and ranking algorithms for documents 
available on the Web and user inter- 
faces for protoype IR systems for the 
Web. 

2.1 Indexing 

The American Heritage. Dictionary 
(1976) defines index as follows: 

(in • dex) 1. Anything that serves to 
guide, point out or otherwise facilitate 
reference, as: a* An alphabetized list- 
ing of names, places, and subjects in- 
cluded in a printed work that gives 
for each item the page on which it 
may be found, b. A series of notches 
cut into the edge of a book for easy 
access to chapters or other divisions, 
c. Any table, file, or catalogue. 

Although the term is used in the same 
spirit in the context of retrieval and 
ranking, it has a specific meaning. 
Some definitions proposed by experts 
are "The most important of the tools for 
information retrieval is the index— a 
collection of terms with pointers to 
places where information about docu- 
ments can be found" [Manber 1999]; 
^indexing is building a data structure 
that will allow quick seaching of the 
text" [Baeza-Yates 1999]; or u the act of 
assigning index terms to documents, 
which are the objects to be retrieved" 
[Korfhage 1997]; a An index term is a 
(document) word whose semantics helps 
in remembering the document's main 
themes" [Baeza-Yates and Ribeiro-Neto 
1999]. Four approaches to indexing doc- 
uments on the Web are (1) human or 
manual indexing; (2) automatic index- 
ing; (3) intelligent or agent-based index- 
ing; and (4) metadata, RDF, and anno- 
tation-based indexing. The first two 
appear in many classical texts, while 
the latter two are relatively new and 



promising areas of study. We first give 
an overview of Web-based indexing, 
then describe or give references to the 
various approaches. 

Indexing Web pages to facilitate re- 
trieval is a much more complex and 
challenging problem than the corre- 
sponding one associated with classical 
databases. The enormous number of ex- 
isting Web pages and their rapid in- 
crease and frequent updating makes 
straightforward indexing, whether by 
human or computer-assisted means, a 
seemingly impossible, Sisyphean task. 
Indeed, most experts agree that, at a 
given moment, a significant portion of 
the Web is not recorded by the indexer 
of any search engine. Lawrence and 
Giles estimated that, in April 1997, the 
lower bound on indexable Web pages 
was 320 million, and a given individual 
search engine will have indexed be- 
tween 3% to 34% of the possible total 
[Lawrence and Giles 1998b], They also 
estimated that the extent of overlap 
among the top six search engines is 
small and their collective coverage was 
only around 60%; the six search engines 
are HotBot, AltaVista, Northern Light, 
Excite, Infoseek, and Lycos. A follow up 
study for the period February 2-28, 1999, 
involving the top 11 search engines (the six 
above plus Snap <snap.com>; Microsoft 
<msn.com>; Google <google.com>; 
Yahoo!; and Euroseek <euroseek.com>) 
indicates that we are losing the index- 
ing race. A far smaller proportion of the 
Web is now indexed with no engine cov- 
ering more than 16% of the Web. Index- 
ing appears to have become more impor- 
tant than ever, since 83% of sites 
contained commercial content and 6% 
contained scientific or educational con- 
tent [Lawrence and Giles 1999a]. 

Bharat and Broder estimated in No- 
vember 1997 that the number of pages - * 
indexed by HotBot, AltaVista, Excite, 
and Infoseek were 77 million, 100 mil- 
lion, 32 million, and 17 million, respec- 
tively. Furthermore, they believe that 
the union of these pages is around 160 
million pages, i.e., about 80% of the 200 
million total accessible pages they believe 
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existed at that time. Their studies indi- 
cate that there is little overlap in the 
indexing coverage, more specifically, 
less than 1.4% (i.e., 2.2 million) of the 
160 million indexed pages were covered 
by all four of the search engines. Melee's 
Indexing Coverage Analysis (MICA) 
Reports <melee.com/mica/index.html> 
provides a weekly update on indexing 
coverage and quality by a few, select, 
search engines that claim to index "at 
least one fifth of the Web." Other stud- 
ies on estimating the extent of Web 
pages that have been indexed by popu- 
lar search engines include Baldonado 
and Winograd [1997]; Hernandez 
[1996]; Hernandez and Stolfo [1995]; 
Hylton [1996]; Monge and Elkan [1998]; 
Selberg and Etzioni [1995a]; and Silber- 
schatz et al. [1995]. 

In addition to the sheer volume of 
documents to be processed, indexers 
must take into account other complex 
issues, for example, Web pages are not 
constructed in a fixed format; the tex- 
tual data is riddled with an unusually 
high percentage of typos— the contents 
usually contain nontextual multimedia 
data, and updates to the pages are 
made at different rates. For instance, 
preliminary studies documented in Na- 
varro [1998] indicate that on the aver- 
age site 1 in 200 common words and 1 in 
3 foreign surnames are misspelled. 
Brake [1997] estimates that the average 
page of text remains unchanged on the 
Web for about 75 days, and Kahle esti- 
mates that 40% of the Web changes 
every month. Multiple copies of identi- 
cal or near-identical pages are abun- 
dant; for example, FAQs ^postings, mir- 
ror sites, old and updated versions of 
news, and newspaper sites. Broder et al. 
[1997] and Shivakumar and Garcfa-Mo- 
lina [1998] estimate that 30% of Web 
pages are duplicates or .near-duplicates. 



2 FAQs, or frequently asked questions, are essays 
on topics on a wide range of interests, with point- 
ers and references. For an extensive list of FAQs, 
see 

< cis.ohio-state.edu/hypertext/faq/UBenet/faq-list. 
html> and <faq.org>. 



Tools for removing redundant URLs or 
URLs of near and perfectly identical 
sites have been investigated by Baldo- 
nado and Winograd [1997]; Hernandez 
[1996]; Hernandez and Stolfo [1995]; 
Hylton [1996]; Monge and Elkan [1998]; 
Selberg and Etzioni [1995a]; and Silber- 
schatz et al. [1995]. 

Henzinger et al. [1999] suggested a 
method for evaluating the quality of 
pages in a search engine's index. In the 
past, the volume of pages indexed was 
used as the primary measurement of 
Web page indexers. Henzinger et al. 
suggest that the quality of the pages in 
a search engine's index should also be 
considered, especially since it has be- 
come clear that no search engine can 
index all documents on the Web, and 
there is very little overlap between the 
indexed pages of major search engines. 
The idea of Henzinger's method is to 
evaluate the quality of Web pages ac- 
cording to its indegree (an evaluation 
measure based on how many other 
pages point to the Web page under con- 
sideration [Carriere and Kazman 1997]) 
and PageRank (an evaluation measure 
based on how many other pages point to 
the Web page under consideration, as 
well as the value of the pages pointing 
to it [Brin and Page 1998; Cho et al. 
1998]). 

The development of effective indexing 
tools to aid in filtering is another major 
class of problems associated with Web- 
based search and retrieval. Removal of 
spurious information is a particularly 
challenging problem, since a popular in- 
formation site (e.g., newsgroup discus- 
sions, FAQ postings) will have little 
value to users with no interest in the 
topic. Filtering to block pornographic 
materials from children .or; for^censpr- , - M „ 
ship of culturally offensive materials is 
another important area for research and 
business devlopment. One of the prom- 
ising new approaches is the use of meta- 
data, i.e., summaries of Web page con- 
tent or sites placed in the page for 
aiding automatic indexers. 
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2.1.1 Classical Methods. Manual in- 
dexing is currently used by several com- 
mercial, Web-based search engines, e.g., 
Galaxy <galaxy.einet.net>; GNN: Whole 
Internet Catalog <elc.gnn.com/gnn/wic/ 
index.html>; Infomine <lib-www.ucr. 
edu>; KidsClick! <sunsite.berkeley.edu/ 
kidsclick!/>; LookSmart <looksmartcom>; 
Subject Tree <bubl.bath.ac.uk/bubl/ 
cattree.html>; Web Developer's Virtual 
Library <stars.com>; World-Wide Web 
Virtual Library Series Subject Catalog 
<w3.org/hypertext/datasources/bysubject/ 
overview.html>; and Yahoo!. The prac- 
tice is unlikely to continue to be as 
successful over the next few years, 
since, as the volume of information 
available over the Internet increases at 
an ever greater pace, manual indexing 
is likely to become obsolete over the 
long term. Another major drawback 
with manual indexing is the lack of 
consistency among different profes- 
sional indexers; as few as 20% of the 
terms to be indexed may be handled in 
the same manner by different individu- 
als [Korfhage 1997, p. 107], and there is 
noticeable inconsistency, even by a 
given individual [Borko 1979; Cooper 
1969; Jacoby and Slamecka 1962; Mac- 
skassys et al. 1998; Preschel 1972; and 
Salton 1969]. 

Though not perfect, compared to most 
automatic indexers, human indexing is 
currently the most accurate because ex- 
perts on popular subjects organize and 
compile the directories and indexes in a 
way which (they believe) facilitates the 
search process. Notable references on 
conventional indexing methods, includ- 
ing automatic indexers, are Part IV of 
Soergel [1985]; Jones and Willett 
[1977]; van Rijsbergen [1977]; and Wit- 
ten et al. [1994, Chap. 3]. Technological 
. advances are. expected to narrow, the 
gap in indexing quality between human 
and machine-generated indexes. In the 
future, human indexing will only be ap- 
plied to relatively small and static (or 
near static) or highly specialized data 
bases, e.g., internal corporate Web 
pages. 



2.15 Crawlers /Robots. Scientists have 
recently been investigating the use of 
intelligent agents for performing specific 
tasks, such as indexing on the Web [Al 
Magazine 1997; Baeza- Yates and Ri- 
beiro-Neto 1999]. There is some ambi- 
guity concerning proper terminology to 
describe these agents. They are most 
commonly referred to as crawlers, but 
are also known as ants, automatic in- 
dexers, bots, spiders, Web robots (Web 
robot FAQ <info.webcrawler.com/mak/ 
projects/robots/faq.html>), and worms. 
It appears that some of the terms were 
proposed by the inventors of a specific 
tool, and their subsequent use spread to 
more general applications of the same 
genre. 

Many search engines rely on automati- 
cally generated indices, either by them- 
selves or in combination with other 
technologies, e.g., Aliweb <nexor.co.uk/ 
public/aliweb/aliweb.html>; AltaVista; 
Excite; Harvest <harvest.transarc.com>; 
HotBot; Infoseek; Lycos; Magellan 
<magellan. com > ; MerzScope <merzcom. 
com>; Northern Light; Smart Spider 
<engsoftware.com>; Webcrawler 
<webcrawler.com/>; and World Wide 
Web Worm. Although most of YahooPs 
entries are indexed by humans or ac- 
quired through submissions, it uses a 
robot to a limited extent to look for new 
announcements. Examples of highly 
specialized crawlers include Argos <argos. 
evansville.edu> for Web sites on the 
ancient and medieval worlds; CACTVS 
Chemistry Spider <schiele.organik. 
uni-erlangen.de/cactvs/spider.html> for 
chemical databases; MathSearch <maths. 
usyd.edu.au:8000/mathsearch.html> for 
English mathematics and statistics doc- 
uments; NEC-MeshExplorer <netplaza. 
biglobe.or.jp/keyword.html> for the 
NETPLAZA search service owned by 
the -NEC- Corporation;=^and> Social- Sci- - 
ence Information Gateway (SOSIG) 
<scout.cs.wisc.edu/scout/mirrors/sosig> 
for resources in the social sciences. 
Crawlers that index documents in lim- 
ited environments include LookSmart 
<looksmart.com/> for a 300,000 site data- 
base of rated and reviewed sites; Robbie 
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the Robot, funded by DARPA for educa- 
tion and training purposes; and UCSD 
Crawl <www.mib.org/ ucsdcrawl> for 
UCSD pages. More extensive lists of in- 
telligent agents are available on The Web 
Robots Page <info.webcrawler.com/mak/ 
projects/robots/active/html/t3rpe.html>; 
and on Washington State University's 
robot pages <wsulibs.wsu.edu/general/ 
robots.htm>. 

To date, there are three major prob- 
lems associated with the use of robots: 
(1) some people fear that these agents 
are too invasive; (2) robots can overload 
system servers and cause systems to be 
virtually frozen; and (3) some sites are 
updated at least several times per day, 
e.g., approximately every 20 minutes by 
CNN <cnn.com> and Bloomberg 
<bloomberg.com>, and every few hours 
by many newspaper sites [Carl 1995] 
(article home page <info.webcrawler.com/ 
mak/proj ects/robots/thi^tK)r-tTeat.html> ); 
[Koster 1995]. Some Web sites deliber- 
ately keep out spiders; for example, the 
New York Times <n3rtimes.c0m> 
requires users to pay and fill out a 
registration form; CNN used to exclude 
search spiders to prevent distortion of 
data on the number of users who visit 
the site; and the online catalogue of the 
British Library <portico.bl.uk> only al- 
lows access to users who have filled out 
an online query form [Brake 1997]. Sys- 
tem managers of these sites must keep 
up with the new spider and robot tech- 
nologies in order to develop their own 
tools to protect their sites from new 
types of agents that intentionally or un- 
intentionally could cause mayhem. 

As a working compromise, Kostner 
has proposed a robots exclusion stan- 
dard ("A standard for robots exclusion," 
ver . 1 : <info.webcrawler.com/mak/proj ects/ 
robots/exclusion.html>; ver. 2: <info. 
webcrawler.com/mak/projects/robots/ 
norobot.html>), which advocates 
blocking certain types of searches to 
relieve overload problems. He has also 
proposed guidelines for robot design 
("Guidelines for robot writers" (1993) 
<info.webcrawler.o>m/mal^roj 
guidelines.html>). It is important to 
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note that robots are not always the root 
cause of network overload; sometimes 
human user overload causes problems, 
which is what happened at the CNN 
site just after the announcement of the 
O.J. Simpson trial verdict [Carl 1995]. 
Use of the exclusion standard is strictly 
voluntary, so that Web masters have no 
guarantee that robots will not be able to 
enter computer systems and create 
havoc. Arguments in support of the ex- 
clusion standard and discussion on its 
effectiveness are given in Carl [1995] 
and Koster [1996]. 

2.1.3 Metadata, RDF, and Annota- 
tions. 

"What is metadata? The Macquarie dictionary de- 
fines the prefix 'rneta- as meaning 'among/ 'to- 
gether with,* 'after' or 'behind.' That suggests 
the idea of a 'fellow traveller '; that metadata is 
not fully fledged data, but it is a kind of fellow- 
traveller with data, supporting it from the side- 
lines. My definition is that 'an element of meta- 
data describes an information resource or helps 
provide access to an information resource."* 
[Cathro 1997] 

In the context of Web pages on the 
Internet, the term "metadata" usually 
refers to an invisible file attached to a 
Web page that facilitates collection of 
information by automatic indexers; the 
file is invisible in the sense that it has , 
no effect on the visual appearance of the 
page when viewed using a standard 
Web browser. 

The World Wide Web (W3) Consor- 
tium <w3.org> has compiled a list of 
resources on information and standard- 
ization proposals for metadata (W3 
metadata page <w3.org/metadata>. A 
number of metadata standards have 
been proposed for Web pages. Among 
them, two well-publicized, solid efforts 
are the Dublin Core Metadata standard: 
home page <purl.oclc.org/metadata/ 
dublin core> and the Warwick* frame--' 
work: article home page <dlib.org/dlib/ 
july96/lagoze/071agoze.html> [Lagoze 
1996]. The Dublin Core is a 15-element 
metadata element set proposed to facili- 
tate fast and accurate information re- 
trieval on the Internet. The elements 
are title; creator; subject; description; 
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publisher; contributors; date; resource 
type; format; resource identifier; source; 
language; relation; coverage; and rights. 
The group has also developed methods 
for incorporating the metadata into a 
Web page file. Other resources on meta- 
data include Chapter 6 of Baeza- Yates 
and Ribeiro-Neto [1999] and Mar- 
chionini [1999]. If the general public 
adopts and increases use of a simple 
metadata standard (such as the Dublin 
Core), the precision of information re- 
trieved by search engines is expected to 
improve substantially. However, wide- 
spread adoption of a standard by inter- 
national users is dubious. 

One of the major drawbacks of the 
simplest type of metadata for labeling 
HTML documents, called metatags, is 
they can only be used to describe con- 
tents of the document to which they are 
attached, so that managing collections 
of documents (e.g., directories or those 
on similar topics) may be tedious when 
updates to the entire collection are 
made. Since a single command cannot 
be used to update the entire collection 
at once, documents must be updated 
one-by-one. Another problem is when 
documents from two or more different 
collections are merged to form a new 
collection. When two or more collections 
are merged, inconsistent use of meta- 
tags may lead to confusion, since a 
metatag might be used in different col- 
lections with entirely different mean- 
ings. To resolve these issues, the W3 
Consortium proposed in May 1999 that 
the Resource Description Framework 
(RDF) be used as the metadata coding 
scheme for Web documents (W3 Consor- 
tium RDF homepage <w3.org/rd£>. An 
interesting associated development is 
IBM's XCentral <ibm.com/developer/ 
xml>, the first search engine that in- 
dexes XML and RDPelements. 1 ^ 

Metadata places the responsibility of 
aiding indexers on the Web page au- 
thor, which is reasonable if the author 
is a responsible person wishing to ad- 
vertise the presence of a page to in- 
crease legitimate traffic to a site. Unfor- 
tunately, not all Web page authors are 



fair players. Many unfair players main- 
tain sites that can increase advertising 
revenue if the number of visitors is very 
high or charging a fee per visit for ac- 
cess to pornographic, violent, and cul- 
turally offensive materials. These sites 
can attract a large volume of visitors by 
attaching metadata with many popular 
keywords. Development of reliable fil- 
tering services for parents concerned 
about their children's surfing venues is 
a serious and challenging problem. 

Spamming, i.e., excessive, repeated 
use of key words or "hidden" text pur- 
posely inserted into a Web page to pro- 
mote retrieval by search engines, is re- 
lated to, but separate from, the 
unethical or deceptive use of metadata. 
Spamming is a new phenomenon that 
appeared with the introduction of 
search engines, automatic indexers, and 
filters on the Web [Flynn 1996; Libera- 
tore 1997]. Its primary intent is to out- 
smart these automated software sys- 
tems for a variety of purposes; 
spamming has been used as an adver- 
tising tool by entrepreneurs, cult re- 
cruiters, egocentric Web page authors 
wanting attention, and technically well- 
versed, but unbalanced, individuals who 
have the same sort of warped mentality 
as inventors of computer viruses. A fa- 
mous example of hidden text spamming 
is the embedding of words in a black 
background by the Heaven's Gate cult. 
Although the cult no longer exists, its 
home page is archived at the sunspot. 
net site <sunspot.net/news/special/ 
heavensgatesite>, a technique known 
as font color spamming [Liberatore 
1997]. We note that the term spamming 
has a broader meaning, related to re- 
ceiving an excessive amount of email or 
information. An excellent, broad over- 
view of the subject is given in Cranor 
and LaMacchia [1998].- In our context* - * 
the specialized terms spam-indexing, 
spam-dexing, or keyword spamming are 
more precise. 

Another tool related to metadata is 
annotation. Unlike metadata, which is 
created and attached to Web documents 
by the author for the specific purpose of 
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aiding indexing, annotations include a 
much broader class of data to be at- 
tached to a Web document [Nagao and 
Hasida 1998; Nagao et al. 1999]. Three 
examples of the most common annota- 
tions are linguistic annotation, com- 
mentary (created by persons other than 
the author), and multimedia annota- 
tion. Linguistic annotation is being used 
for automatic summarization and con- 
tent-based retrieval. Commentary anno- 
tation is used to annotate nontextual 
multimedia data, such as image and 
sound data plus some supplementary 
information. Multimedia annotation gen- 
erally refers to text data, which describes 
the contents of video data (which may be 
downloadable from the Web page). An 
interesting example of annotation is the 
attachment of comments on Web docu- 
ments by people other than the document 
author. In addition to aiding indexing and 
retrieval, this kind of annotation may be 
helpful for evaluating documents. 

Despite the promise that metadata 
and annotation could facilitate fast and 
accurate search and retrieval, a recent 
study for the period February 2-28, 
1999 indicates that metatags are only 
used on 34% of homepages, and only 
0.3% of sites use the Dublin Core meta- 
data standard [Lawrence and Giles 
1999a]. Unless a new trend towards the 
use of metadata and annotations devel- 
ops, its usefulness in information re- 
trieval may be limited to very large, 
closed data owned by large corporations, 
public institutions, and governments 
that choose to use it. 

2.2 Clustering 

Grouping similar documents together to 
expedite information retrieval is known 
as clustering [Anick and Vaithyanathan 
1997; *-»Rasmussen- 1*992;* -Sneath and 
Sokal 1973; Willett 1988]. During the 
information retrieval and ranking pro- 
cess, two classes of similarity measures 
must be considered: the similarity of a 
document and a query and the similar- 
ity of two documents in a database. The 
similarity of two documents is impor- 



tant for identifying groups of documents 
in a database that can be retrieved and 
processed together for a given type of 
user input query. 

Several important points should be 
considered in the development and im- 
plementation of algorithms for cluster- 
ing documents in very large databases. 
These include identifying relevant at- 
tributes of documents and determining 
appropriate weights for each attribute; 
selecting an appropriate clustering 
method and similarity measure; esti- 
mating limitations on computational 
and memory resources; evaluating the 
reliability and speed of the retrieved 
results; facilitating changes or updates 
in the database, taking into account the 
rate and extent of the changes; and 
selecting an appropriate search algo- 
rithm for retrieval and ranking. This 
final point is of particularly great con- 
cern for Web-based searches. 

There are two main categories of clus- 
tering: hierarchical and nonhierarchi- 
caL Hierarchical methods show greater 
promise for enhancing Internet search 
and retrieval systems. Although details 
of clustering algorithms used by major 
search engines are not publicly avail- 
able, some general approaches are 
known. For instance, Digital Equipment 
Corporation's Web search engine Alta- 
Vista is based on clustering. Anick and 
Vaithyanathan [1997] explore how to 
combine results from latent semantic 
indexing (see Section 2.4) and analysis 
of phrases for context-based information 
retrieval on the Web. 

Zamir et al. [1997] developed three 
clustering methods for Web documents. 
In the word-intersection clustering 
method, words that are shared by docu- 
ments are used to produce clusters. The 
method runs in 0(n 2 ) time and produces.. . r . 
good results for Web documents. A second 
method, phrase-intersection clustering, 
runs in 0(nlog n) time is at least two 
orders of magnitude faster than methods 
that produce comparable clusters. A third 
method, called suffix tree clustering is de- 
tailed in Zamir and Etzioni [1998]. 
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Modha and Spangler [2000] developed 
a clustering method for hypertext docu- 
ments, which uses words contained in 
the document, outlinks from the docu- 
ment, and in-links to the document. 
Clustering is based on six information 
nuggets, which the authors dubbed 
summary, breakthrough, review, key- 
words, citation, and reference. The first 
two are derived from the words in the 
document, the next two from the out- 
links, and the last two from the in-links. 

Several new approaches to clustering 
documents in data mining applications 
have recently been developed. Since 
these methods were specifically de- 
signed for processing very large data 
sets, they may be applicable with some 
modifications to Web-based information 
retrieval systems. Examples of some of 
these techniques are given in Agrawal et 
al. [1998]; Dhillon and Modha [1999; 
2000]; Ester et al. [1995a; 1995b; 1995c]; 
Fisher [1995]; Guha et al. [1998]; Ng and 
Han [1994]; and Zhang et al. [1996]. For 
very large databases, appropriate parallel 
algorithms can speed up computation 
[Omiecinski and Scheuermann 1990]. 

Finally, we note that clustering is just 
one of several ways of organizing docu- 
ments to facilitiate retrieval from large 
databases. Some alternative methods 
are discussed in Frakes and Baeza- 
Yates [1992]. Specific examples of some 
methods designed specifically for facili- 
tating Web-based information retrieval 
are evaluation of significance, reliabil- 
ity, and topics covered in a set of Web 
pages based on analysis of the hyper- 
link structures connecting the pages 
(see Section 2.4); and identification of 
cyber communities with expertise in 
subjects) based on user access fre- 
quency and surfing patterns. 

2.3 User Interfaces ** 

Currently, most Web search engines are 
text-based. They display results from 
input queries as long lists of pointers, 
sometimes with and sometimes without 
summaries of retrieved pages. Future 
commercial systems are likely to take 



advantage of small, powerful comput- 
ers, and will probably have a variety of 
mechanisms for querying nontextual 
data (e.g., hand-drawn sketches, tex- 
tures and colors, and speech) and better 
user interfaces to enable users to visu- 
ally manipulate retrieved information 
[Card et al. 1999; Hearst 1997; Maybury 
and Wahlster 1998; Rao et al. 1993; 
Tufte 1983]. Hearst [1999] surveys visu- 
alization interfaces for information re- 
trieval systems, with particular emphasis 
on Web-based systems. A sampling of 
some exploratory works being conducted 
in this area are described below. These 
interfaces and their display systems, 
which are known under several different 
names (e.g., dynamic querying, informa- 
tion outlining, visual information seek- 
ing), are being developed at universities, 
government, and private research labs, 
and small venture companies worldwide. 

2.3.1 Metasearch Navigators. A very 
simple tool developed to exploit the best 
features of many search engines is the 
metasearch navigator. These navigators 
allow simultaneous search of a set of 
other navigators. Two of the most exten- 
sive are Search.com < search.com/>, 
which can utilize the power of over 250 
search engines, and INFOMINE <lib- 
www.ucr.edu/enbinfo.html>, which uti- 
lizes over 90. Advanced metasearch 
navigators have a single input interface 
that sends queries to all (or only user 
selected search engines), eliminates du- 
plicates, and then combines and ranks 
returned results from the different search 
engines. Some fairly simple examples 
available on the Web are 2ask <web. 
gazeta.pl/miki/search/2ask-anim.html>; 
ALL-IN-ONE <albany.net/allinone/>; 
EZ-Find at The River <theriver.com/ 
theRiver/explore/ezfind.html>; IBM Info- 
Market -Service <infomkt.ibm.com/>; - 
Inference Find <inference.com/infind/>; 
Internet Sleuth <intbc.com/sleuth>; Meta- 
Crawler <metacrawler.cs.washington.edu: 
8080/>; and SawySearch <cs.colostat.edu/ 
dreiling/smaitf orm.html > and <guaraldi. 
cs.colostate.edu:2000/> [Howe and 
Dreilinger 1997]. 
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2.3.2 Web-Based Information Outlin- 
ing I Visualization . Visualization tools 
specifically designed to help users un- 
derstand websites (e.g., their directory 
structures, types of information avail- 
able) are being developed by many pri- 
vate and public research centers 
[Nielsen 1997]. Overviews of some of 
these tools are given in Ahlberg and 
Shneiderman [1994]; Beaudoin et al. 
[1996]; Bederson and Hollan [1994]; Gloor 
and Dynes [1998]; Lamping et al. [1995]; 
Liechti et al. [1998]; Maarek et al. [1997]; 
Munzner and Burchard [1995]; Robertson 
et al. [1991]; and Tetranet Software Inc.- 
[1998] <tetranetsoftware.com>. Below 
we present some examples of interfaces 
designed to facilitate general informa- 
tion retrieval systems, we then present 
some that were specifically designed to 
aid retrieval on the Web. 

Shneiderman [1994] introduced the 
term dynamic queries to describe inter- 
active user control of visual query pa- 
rameters that generate a rapid, up- 
dated, animated visual display of 
database search results. Some applica- 
tions of the dynamic query concept are 
systems that allow real estate brokers 
and their clients to locate homes based 
on price, number of bedrooms, distance 
from work, etc. [Williamson and Shnei- 
derman 1992]; locate geographical re- 
gions with cancer rates above the na- 
tional average [Plaisant 1994]; allow 
dynamic querying of a chemistry table 
[Ahlberg and Shneiderman 1997]; with 
an interface to enable users to explore 
UNIX directories through dynamic que- 
ries [Liao et al. 1992]: Visual presenta- 
tion of query components; visual presen- 
tation of results; rapid, incremental, 
and reversible actions; selection by 
pointing (not typing); and immediate 
and continuous feedback are features of 
-the systems. Most graphics hardware 
systems in the mid-1990's were still too 
weak to provide adequate real-time in- 
teraction, but faster algorithms and ad- 
vances in hardware should increase sys- 
tem speed up in the future. 

Williams [1984] developed a user in- 
terface for information retrieval sys- 



tems to "aid users in formulating a que- 
ry." The system, RABBIT III, supports 
interactive refinement of queries by al- 
lowing users to critique retrieved re- 
sults with labels such as "require 99 and 
"prohibit" Williams claims that this 
system is particularly helpful to naive 
users "with only a vague idea of what 
they want and therefore need to be 
guided in the formulation/reformulation 
of their queries . . . (or) who have lim- 
ited knowledge of a given database or 
who must deal with a multitude of data- 
bases." 

Hearst [1995] and Hearst and Peder- 
son [1996] developed a visualization 
system for displaying information about 
a document and its contents, e.g., its 
length, frequency of term sets, and dis- 
tribution of term sets within the docu- 
ment and to each other. The system, 
called TileBars, displays information 
about a document in the form of a two- 
dimensional rectangular bar with even- 
sized tiles lying next to each other in an 
orderly fashion. Each tile represents 
some feature of the document; the infor- 
mation is encoded as a number whose 
magnitude is represented in grayscale. 

Cutting et al. [1993] developed a sys- 
tem called Scatter/Gather to allow users 
to cluster documents interactively, 
browse the results, select a subset of the 
clusters, and cluster this subset of docu- 
ments. This process allows users to iter- 
atively refine their search. BEAD 
[Chalmers and Chitson 1992]; Galaxy of 
News [Rennison 1994]; and Theme- 
Scapes [Wise et al. 1995] are some of 
the other systems that show graphical 
displays of clustering results. 

Baldonado [1997] and Baldonado and 
Winograd [1997] developed an interface 
for exploring information on the Web 
-across heterogeneous sources, e.g., 
search services such as Alta Vista, bib- 
liographic search services such as Dia- 
log, a map search service and a video 
search service. The system, called Sense- 
Maker, can "bundle" (i.e., cluster) simi- 
lar types of retrieved data according to 
user specified "bundling criteria" (the 
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criteria must be selected from a fixed 
menu provided by SenseMaker). Exam- 
ples of available bundling criteria for a 
URL type include "(1) bundling results 
whose URLs refer to the same site; (2) 
bundling results whose URLs refer to 
the same collection at a site; and (3) not 
bundling at all." The system allows us- 
ers to select from several criteria to 
view retrieved results, e.g., according to. 
the URL, and also allows users to select 
from several criteria on how duplicates 
in retrieved information will be elimi- 
nated. Efficient detection and elimina- 
tion of duplicate database records and 
duplicate retrievals by search engines, 
which are very similar but not necessar- 
ily identical, have been investigated ex- 
tensively by many scientists, e.g., Her- 
nandez [1996]; Hernandez and Stolfo 
[1995]; Hylton [1996]; Monge and Elkan 
[1998]; and Silberschatz et al. [1995]. 

Card et al. [1996] developed two 3D 
virtual interface tools, WebBook and 
WebForager, for browsing and recording 
Web pages. Kobayashi et al. [1999] de- 
veloped a system to compare how rele- 
vance ranking of documents differ when 
queries are changed. The parallel rank- 
ing system can be used in a variety of 
applications, e.g., query refinement and 
understanding the contents of a data- 
base from different perspectives (each 
query represents a different user per- 
spective). Manber et al. [1997] devel- 
oped WebGlimpse, a tool for simulta- 
neous searching and browsing Web 
pages, which is based on the Glimpse 
search engine. 

Morohashi et al. [1995] and Takeda 
and Nomiyama [1997] developed a sys- 
tem that uses new technologies to orga- 
nize and display, in an easily discern- 
ible form, a massive set of data. The 
system, called "information outlining" 
* * 1 extracts and analyzes a variety of fea- 1 
tures of the data set and interactively 
visualizes these features through corre- 
sponding multiple, graphical viewers. 
Interactions with multiple viewers facil- 
itates reducing candidate results, profil- 
ing information, and discovering new 
facts. Sakairi [1999] developed a site 



map for visualizing a Web site's struc- 
ture and keywords. 

2.3.3 Acoustical Interfaces. Web- 
based IR contributes to the acceleration 
of studies on and development of more 
user friendly, nonvisual, input-output 
interfaces. Some examples of research 
projects are given in a special journal 
issue on the topic "the next generation 
graphics user interfaces (GUIs)" [CACM 
1993]. An article in Business Week 
[1977] discusses user preference for 
speech-based interfaces, i.e., spoken in- 
put (which relies on speech recognition 
technologies) and spoken output (which 
relies on text-to-speech and speech syn- 
thesis technologies). 

One response to this preference by 
Asakawa [1996] is a method to enable 
the visually impaired to access and use 
the Web interactively, even when Japa- 
nese and English appear on a page (IBM 
Homepage on Systems for the Disabled 
<trl.ibm.co.jp/projects/s7260/sysde.htm>). 
The basic idea is to identify different 
languages (e.g., English, Japanese) and 
different text types (e.g., title and sec- 
tion headers, regular text, hot buttons) 
and then assign persons with easily dis- 
tinguishable voices (e.g., male, female) 
to read each of the different types of text. 
More recently, the method has been ex- 
tended to enable the visually impaired 
to access tables in HTML [Oogane and 
Asakawa 1998]. 

Another solution, developed by Ra- 
man [1996], is a system that enables 
visually impaired users to surf the Web 
interactively. The system, called Emac- 
speak, is much more sophisticated than 
screen readers. It reveals the structure 
of a document (e.g., tables or calendars) 
in addition to reading the text aloud. 

A third acoustic-based approach for 
Web browsing- is being investigated by 
Mereu and Kazman [1996]. They exam- 
ined how sound environments can be 
used for navigation and found that 
sighted users prefer musical environ- 
ments to enhance conventional means of 
navigation, while the visually impaired 
prefer the use of tones. The components 
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of all of the systems described above can 
be modified for more general systems 
(i.e., not necessarily for the visually im- 
paired) which require an audio/speech- 
based interface. 



2.4 Ranking Algorithms for Web-Based 
Searches 

A variety of techniques have been devel- 
oped for ranking retrieved documents 
for a given input query. In this section 
we give references to some classical 
techniques that can be modified for use 
by Web search engines [Baeza- Yates 
and Ribeiro-Neto 1999; Berry and 
Browne 1999; Frakes and Baeza-Yates 
1992]. Techniques developed specifically 
for the Web are also presented. 

Detailed information regarding rank- 
ing algorithms used by major search 
engines is not publicly available, howev- 
er—it seems that most use term weight- 
ing or variations thereof or vector space 
models [Baeza-Yates and Ribeiro-Neto 
1999]. In vector space models, each doc- 
ument (in the database under consider- 
ation) is modeled by a vector, each coor- 
dinate of which represents an attribute 
of the document [Salton 1971], Ideally, 
only those that can help to distinguish 
documents are incorporated in the at- 
tribute space. In a Boolean model, each 
coordinate of the vector is zero (when 
the corresponding attribute is absent) 
or unity (when the corresponding at- 
tribute is present). Many refinements of 
the Boolean model exist. The most com- 
monly used are term-weighting models, 
which take into account the frequency 
of appearance of an attribute (e.g., key- 
word) or location of appearance (e.g., 
keyword in the title, section header, or 
abstract).. In. the simplest. retrieval and 
ranking systems, each query is also 
modeled by a vector in the same manner 
as the documents. The ranking of a doc- 
ument with respect to a query is deter- 
mined by its "distance" to the query 
vector. A frequently used yardstick is 
the angle defined by a query and docu- 



ment vector. 3 Ranking a document is 
based on computation of the angle de- 
fined by the query and document vector. 
It is impractical for very large data- 
bases. 

One of the more widely used vector 
space model-based algorithms for reduc- 
ing the dimension of the document 
ranking problem is latent semantic in- 
dexing (LSI) [Deerwester et al. 1990]. 
LSI reduces the retrieval and ranking 
problem to one of significantly lower 
dimensions, so that retrieval from very 
large databases can be performed in 
real time. Although a variety of algo- 
rithms based on document vector mod- 
els for clustering to expedite retrieval 
and ranking have been proposed, LSI is 
one of the few that successfully takes 
into account synonymy and polysemy. 
Synonymy refers to the existence of 
equivalent or similar terms, which can 
be used to express an idea or object in 
most languages, and polysemy refers to 
the fact that some words have multiple, 
unrelated meanings. Absence of ac- 
counting for synonymy will lead to 
many small, disjoint clusters, some of 
which should actually be clustered to- 
gether, while absence of accounting for 
polysemy can lead to clustering together 
of unrelated documents. 

In LSI, documents are modeled by 
vectors in the same way as Saltan's 
vector space model. We represent the 
relationship between the attributes and 
documents by an m -by-n (rectangular) 
matrix A, with i/-th entry Oy, i.e., 

A = [<]. 

The column vectors of A represent the 
documents in the database. Next, we 
compute the singular value decomposi- 
tion (SVD) of A,.then 4 constanict a modified 
matrix A kJ from the k largest singular 



3 The angle between two vectors is determined by 
computing the dot product and dividing by the 
product of the / 2 - norms of the vectors. 
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values <Ti\ i = 1, 2, k y and their cor- analysis of hyperlink structures for use 
responding vectors, i.e., on the Web [Botafogo et al. 1992; Carri- 

ere and Kazman 1997; Chakrabarti et 

■=;"■; :> >*'' A k % V'k$kVZ- - :al - 1988 ; Chakrabarti et al. 1998; Frisse 

f :: ;T • -v ."' ■ •" 7 , ""; ~ 1988; Kleinberg 1998; Pirolli et al. 1996; 

ijl ;££is^ Rivlin et al. 1994]. 

B;!o|lj^ A simple means to measure the qual- 

; Web P a S e > Proposed by Carriere 

ilCWhe^ L 1997 L 1S to count the 

m.^eife t? us | d the Web( ^ er y s y fi - 

*M ™ d the Rankdex search engine 

* :i^!:ffi1^e ; pityectibn s^linput queried 1 <rankde X .gan.com>. Google which 
I!; «MI|j*d : ^ . currentl 7 index f s about 85 million Web 
I MlucSd£^^ another search engine that 

^tm^l 1 *^^^ link infomation. Its rankings are 

• : matr f 'km^^M ^orre- f : based> in part> on the number of % ther 

jspondu^- .^^^.^M^ip^ with pointers to the page. This 
:Jre<^^ seems to slightly favor educa- 

jP?r^N^!^? ^e^g^m^ and government sites over com- 

^ 'Ui ! :.y mercial ones. In November 1999, North- 

h *ld^t?y? :i or J'y.zy : ^J- -I- :: '"■ ern Light introduced a new ranking 
;;;k .^1?-? ~. :%F*'?* ? < .'i* system, which is also based, in part, on 

:-|;-||;qt . -.; £ iii3i : link data (Search Engine Briefs 

'•■<^^^^^f^^ ^ Ww^i^Wi^ <searchenginewatch.con^ereport/99/ll- 
1 vector; the pseudodocument; :- f qT . i the - 1 briefs.html>). 
; :: - transpose of q; and ; r :the : inverse^ The hyperlink structures are used to 
'.illo^rato^ similar^- rank retrieved pages, and can also be 

; : i:-ti^^ for clustering relevant pages on 

:n- ddciinients m;; th : e: ; i-reduced. tei*;= : doca-;. . dlfferent topics. This concept of corefer- 
;r Mbxlt i^ace f ^ ^re coinputed tisihg anv M en * n / a as a means of discovering so- 
'I one of inany similarity m^asures^uch ^L caUed communities of good works was 
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Salton ; [l§8^]; ^table reviews: of UneM [1 ?pi and White and McCain [1989]. 
: ^gebra^: techniques,; incliidinfelLSI arid! ■= . Weinberg [1998] developed an algo- 
its applications to information retrieval,: i nthm to find the several most mforma- 
: ^are BeiTy ;et 9^51 arid itetsche and : ^ tion-nch or, authority, pages for a 
;-::l-Be^ Jl^uery. The algorithm also finds hub 

pll ■ i.Stltist^ ie -> pages with links to many 

pages, and labels the two 
; : Web -Search en-: ; : types of retrieved pages appropriately. 

PER^ teviewed in! jj 

!:Ci!GiresSmi' et ki^[i9&8] anS^M^ning^and j = i 

;; ! iSchuilS &^WWm&i:W\ * i fr : t- m 3 - future directions • 

Several scientists have proposed in- T 

formation retrieval algorithms based on In this section we present some promis- 
ing and imaginative research endeavors 

that are likely to make an impact on 

♦For details on implementation of the SVD algo- Web use in some form or variation in 

rithm, see Demmel [1997]; Golub and Loan [1996]; the future. Knowledge management 

and Parlett [1998]. [IEEE 1998b]. 
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3.1 Intelligent and Adaptive Web Services 

As mentioned earlier, research and de- 
velopment of intelligent agents (also 
known as bots, robots, and aglets) for 
performing specific tasks on the Web 
has become very active [Finin et al. 
1998; IEEE 1996a], These agents can 
tackle problems including finding and 
filtering information; customizing infor- 
mation; and automating completion of 
simple tasks [Gilbert 1997]. The agents 
"gather information or perform some 
other service without (the user's) imme- 
diate presence and on some regular 
schedule" (whatis?com home page 
<whatis.com/intellig.htm>). The BotSpot 
home page <botspot.com> summarizes 
and points to some historical informa- 
tion as well as current work on intelli- 
gent agents. The Proceedings of the As- 
sociation for Computing Machinery 
(ACM), see Section 5.1 for the URL; the 
Conferences on Information and Know- 
ledge Management (CIKM); and the 
American Association for Artificial In- 
telligence Workshops <www.aaai.org> 
are valuable information sources. The 
Proceedings of the Practical Applica- 
tions of Intelligent Agents and Multi- 
Agents (PAAM) conference series 
<demon.co.uk/ar/paam96>and<demon. 
co.uk/ar/paam97> gives a nice overview 
of application areas. The home page of 
the IBM Intelligent Agent Center of 
Competence (IACC) <networking.ibm. 
com/iag/iaghome.html> describes some 
of the company's commercial agent 
products and technologies for the Web. 

Adaptive Web services is one interest- 
ing area in intelligent Web robot re- 
search, including, e.g., Ahoy! The 
Homepage Finder, which performs dy- 
namic reference sifting [Shakes et al. 
1997]; Adaptive Web Sites, which "auto- 
matically improve their organization 
and presentation based on user access 
data" [Etzioni and Weld 1995; Perkowitz 
and Etzioni 1999]; Perkowitz's home 
page <info.cs.vt.edu>; and Adaptive 
Web Page Recommendation Service 
[Balabanovic 1997; Balabanovic and 
Shoham 1998; Balabanovic et al. 1995]. 



Discussion and ratings of some of these 
and other robots are available at several 
Web sites, e.g., Felt and Scales <wsulibs. 
wsu.edu/general/robots.htm> and Mitchell 
[1998]. 

Some scientists have studied proto- 
type metasearchers, i.e., services that 
combine the power of several search en- 
gines to search a broader range of pages 
(since any given search engine covers 
less than 16% of the Web) [Gravano 
1997; Lawrence and Giles 1998a; Sel- 
berg and Etzioni 1995a; 1995b]. Some of 
the better known metasearch engines 
include MetaCrawler, SawySearch, and 
InfoSeek Express. After a query is is- 
sued, metasearchers work in three main 
steps: first, they evaluate which search 
engines are likely to yield valuable, 
fruitful responses to the query; next, 
they submit the query to search engines 
with high ratings; and finally, they 
merge the retrieved results from the 
different search engines used in the pre- 
vious step. Since different search en- 
gines use different algorithms, which 
may not be publicly available, ranking of 
merged results may be a very difficult task. 

Scientists have investigated a number 
of approaches to overcome this problem. 
In one system, a result merging condi- 
tion is used by a metasearcher to decide 
how much data will be retrieved from 
each of the search engine results, so 
that the top objects can be extracted 
from search engines without examining 
the entire contents of each candidate 
object [Gravano 1997], Inquirus down- 
loads and analyzes individual docu- 
ments to take into account factors such 
as query term context, identification of 
dead pages and links, and identification 
of duplicate (and near duplicate) pages 
[Lawrence and Giles 1998a]. Document 
ranking is based on the downloaded doc- 
ument itself, instead of* rankings from - 
individual search engines. 

3.2 Information Retrieval for Internet 
Shopping 

An intriguing application of Web robot 
technology is in simulation and prediction 
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of pricing strategies for sales over the 
Internet. The 1999 Christmas and holi- 
day season marked the first time that 
shopping online was no longer a predic- 
tion; "Online sales increased by 300 per- 
cent and the number of orders increased 
by 270 percent" compared to the previ- 
ous year [Clark 2000]. To underscore 
the point, Time magazine selected Jeff 
Bezos, founder ofAmazon.com as 1999 
Person of the Year. Exponential growth 
is predicted in online shopping. Charts 
that illustrate projected growth in In- 
ternet-generated revenue, Internet-re- 
lated consumer spending, Web advertis- 
ing revenue, etc. from the present to 
2002, 2003, and 2005 are given in Nua's 
survey pages (see Section 1.2 for the 
URL). 

Robots to help consumers shop, or 
shopbots, have become commonplace in 
e-commerce sites and general-purpose 
Web portals. Shopbot technology has 
taken enormous strides since its initial 
introduction in 1995 by Anderson Con- 
sulting. This first bot, known as Bar- 
gain Finder, helped consumers find the 
lowest priced CDs. Many current shop- 
bots are capable of a host of other tasks 
in addition to comparing prices, such as 
comparing product features, user re- 
views, delivery options, and warranty 
information. Clark [2000] reviews the 
state-of-the-art in bot technology and 
presents some predictions for the fu- 
ture by experts in the field— for exam- 
ple, Kephart, manager of IBM's Agents 
and Emergent Phenomena Group, pre- 
dicts that "shopping bots may soon be 
able to negotiate and otherwise work 
with vendor bots, interacting via ontolo- 
gies and distributed technologies... bots 
would then become 'economic actors 
making decisions'" and Guttman, chief 
technology officer for Frictionless com- 
merce <fricti6iiless."c6m>~ footnotes that 
Frictionless's bot engine is used by some 
famous portals, including Lycos, and 
mentions that his company's technology 
will be used in a retailer bot that will 
"negotiate trade-offs between product 
price, performance, and delivery times 
with shopbots on the basis of customer 



preferences." Price comparison robots 
and their possible roles in Internet mer- 
chant price wars in the future are dis- 
cussed in Kephart et al. [1998a; 1998b]. 

The auction site is another successful 
technological off-shoot of the Internet 
shopping business [Cohen 2000; Ferguson 
2000]. Two of the more famous general 
online auction sites are priceline.com 
<priceline.com> and eBay <ebay.com>. 
Priceline.com pioneered and patented 
its business concept, i.e., online bidding 
[Walker et al. 1997]. Patents related to 
that of priceline.com include those 
owned by ADT Automotive, Inc. [Berent ~ 
et al. 1998]; Walker Asset Management 
[Walker et al. 1996]; and two individu- 
als [Barzilai and Davidson 1997]. 

3.3 Multimedia Retrieval 

IR from multimedia databases is a mul- 
tidisciplinary research area, which in- 
cludes topics from a very diverse range, 
such as analysis of text, image and 
video, speech, and nonspeech audio; 
graphics; animation; artificial intelli- 
gence; human-computer interaction; 
and multimedia computing [Faloutsos 
1996; Faloutsos and Lin 1995; Maybury 
1997; and Schauble 1997]. Recently, 
several commercial systems that inte- 
grate search capabilities from multiple 
databases containing heterogeneous, 
multimedia data have become available. 
Examples include PLS <pls.com>; 
Lexis-Nexis <lexis-nexis.com>; DIALOG 
<dialog.com>; and Verity <verity.com>. 
In this section we point to some recent 
developments in the field; but the dis- 
cussion is by no means comprehensive. 

Query and retrieval of images is one 
of the more established fields of re- 
search involving multimedia databases 
[IEEE ICIP: Proceedings of the IEEE 
International Conference on Image* Pro?" * "* " v " :K " 
cessing and IEEE ICASSP: Proceedings 
of the IEEE International Conference on 
Acoustics, Speech and Signal Processing 
and IFIP 1992]. So much work by so 
many has been conducted on this topic 
that a comprehensive review is beyond 
the scope of this paper. But some se- 
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lected work in this area follows: search 
and retrieval from large image archives 
[Castelli et al. 1998]; pictorial queries 
by image similarity [Soffer and Samet]; 
image queries using Gabor wavelet fea- 
tures [Manjunath and Ma 1996]; fast, 
multiresolution image queries using 
Haar wavelet transform coefficients [Ja- 
cobs et al. 1995]; acquisition, storage, 
indexing, and retrieval of map images 
[Samet and Soffer 1986]; real-time fin- 
gerprint matching from a very large da- 
tabase [Ratha et al. 1992]; querying and 
retrieval using partially decoded JPEG 
data and keys [Schneier and Abdel-Mot- 
taleb 1996]; and retrieval of faces from a 
database [Bach et al. 1993; Wu and 
Narasimhalu 1994]. 

Finding documents that have images 
of interest is a much more sophisticated 
problem. Two well-known portals with a 
search interface for a database of im- 
ages are the Yahoo! Image Surfer 
<isurf.yahoo.com> and the Alta Vista 
PhotoFinder <image.altavistaxom>. Like 
YahooPs text-based search engine, the 
Image Surfer home pages are organized 
into categories. For a text-based query, 
a maximum of six thumbnails of the 
top-ranked retrieved images are dis- 
played at a time, along with their titles. 
If more than six are retrieved, then 
links to subsequent pages with lower 
relevance rankings appear at the bot- 
tom of the page. The number of entries 
in the database seem to be small'; we 
attempted to retrieve photos of some 
famous movie stars and came up with 
none (for Brad Pitt) or few retrievals 
(for Gwyneth Paltrow), some of which 
were outdated or unrelated links. The 
input interface to Photofinder looks 
very much like the interface for Alta 
Vista's text-based search engine. For a 
text-based query, a maximum of twelve 
thumbnails of retrieved images atre dis- 
played at a time. Only the name of the 
image file is displayed, e.g., image.jpg. 
To read the description of an image (if it 
is given), the mouse must point to the 
corresponding thumbnail. The number 
of retrievals for Photofinder were huge 
(4232 for Brad Pitt and 119 for Gwyneth 



Paltrow), but there was a considerable 
amount of noise after the first page of 
retrievals and there were many redun- 
dancies. Other search engines with an 
option for searching for images in their 
advanced search page are Lycos, Hot- 
Bot, and AltaVista. All did somewhat 
better than Photofinder in retrieving 
many images of Brad Pitt and Gwyneth 
Paltrow; most of the thumbnails were 
relevant for the first several pages (each 
page contained 10 thumbnails). 

NEC's Inquirus is an image search 
engine that uses results from several 
search engines. It analyzes the text ac- ~ 
companying images to determine rele- 
vance for ranking, and downloads the 
actual images to create thumbnails that 
are displayed to the user [Lawrence and 
Giles 1999c]. 

Query and retrieval of images in a 
video frame or frames is a research area 
closely related to retrieval of still im- 
ages from a very large image database 
[Bolle et al. 1998], We mention a few to 
illustrate the potentially wide scope of 
applications, e.g., content-based video 
indexing retrieval [Smoliar and Zhang 
1994]; the Query-by-Image-Content 
(QBIC) system, which helps users find 
still images in large image and video 
databases on the basis of color, shape, 
texture, and sketches [Flickner et al. 
1997; Niblack 1993]; Information Navi- 
gation System (INS) for multimedia 
data, a system for archiving and search- 
ing huge volumes of video data via Web 
browsers [Nomiyama et al. 1997]; and 
VisualSEEk, a tool for searching, brows- 
ing, and retrieving images, which allows 
users to query for images using the vi- 
sual properties of regions and their spa- 
tial layout [Smith and Chang 1997a; 
1996]; compressed domain image ma- 
nipulation and feature extraction for 
"compressed domain image and video in- 
dexing and searching [Chang 1995; 
Zhong and Chang 1997]; a method for 
extracting visual events from relatively 
long videos uing objects (rather than 
keywords), with specific applications to 
sports events [Iwai et al. 2000; Kuro- 
kawa et al. 1999]; retrieval and semantic 
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interpretation of video contents based 
on objects and their behavior [Echigo et 
al. 2000]; shape-based retrieval and its 
application to identity checks on fish 
[Schatz 1997]; and searching for images 
and videos on the Web [Smith and 
Chang 1997b]. 

Multilingual communication on the 
Web [Miyahara et al. 2000] and cross- 
language document retrieval is a timely 
research topic being investigated by 
many [Ballesteros and Croft 1998; Eich- 
mann et al. 1998; Pirkola 1998]. An 
introduction to the subject is given in 
Oard [1997b], and some surveys are 
found in CLIR [1999] (Cross-Language 
Information Retrieval Project <clis.umd. 
edu/dlrg>); Oard [1997a] <glue.umd. 
edu/oard7research.html> and in Oard 
and Door [1996], Several search engines 
now feature multilingual search, e.g., 
Open Text Web Index <index.opentext. 
net> searches in four languages (En- 
glish, Japanese, Spanish, and Portu- 
guese). A number of commercial 
Japanese-to-English and English-to- 
Japanese Web translation software 
products have been developed by lead- 
ing Japanese companies in Japanese 
<bekkoame.ne.jp/oto3>. A typical ex- 
ample, which has a trial version for 
downloading, is a product called Hon- 
yaku no Oosama <ibm.co.jp/software/ 
internet/king/index.html>, or Internet 
King of Translation [Watanabe and 
Takeda 1998]. 

Other interesting research topics and 
applications in multimedia IR are 
speech-based IR for digital libraries 
[Oard 1997c] and retrieval of songs from 
a database when a user hums the first 
few bars of a tune [Kageyama and 
Takashima 1994]. The melody retrieval 
technology has been incorporated as an 
interface in a karaoke machine. 

3.4 Conclusions 

Potentially lucrative application of In- 
ternet-based IR is a widely studied and 
hotly debated topic. Some pessimists be- 
lieve that current rates of increase in 
the use of the Internet, number of Web 



sites and hosts are not sustainable, so 
that research and business opportuni- 
ties in the area will decline. They cite 
statistics such as the April 1998 GVU 
WWW survey, which states that the use 
of better equipment (e.g., upgrades in 
modems by 48% of people using the 
Web) has not resolved the problem of 
slow access, and an August 1998 survey 
by Alexa Internet stating that 90% of all 
Web traffic is spread over 100,000 dif- 
ferent hosts, with 50% of all Web traffic 
headed towards the top 900 most popu- 
lar sites. In short, the pessimists main- 
tain that an effective means of manag- 
ing the highly uneven concentration of 
information packets on the Internet is 
not immediately available, nor will it be 
in the near future. Furthermore, they 
note that the exponential increase in 
Web sites and information on the Web 
is contributing to the second most com- 
monly cited problem, that is, users not 
being able to find the information they 
seek in a simple and timely manner. 

The vast majority of publications, 
however, support a very optimistic view. 
The visions and research projects of 
many talented scientists point towards 
finding concrete solutions and building 
more efficient and user-friendly solu- 
tions. For example, McKnight and 
Boroumand [2000] maintain that flat 
rate Internet retail pricing— currently 
the predominant pricing model in the 
U.S.— may be one of the major culprits 
in the traffic-congestion problem, and 
they suggest that other pricing models 
are being proposed by researchers. It is 
likely that the better proposals will be 
seriously considered by the business 
community and governments to avoid 
the continuation of the current solution, 
i.e., overprovisioning of bandwidth. 
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